[SPARK-19602][SQL] Support column resolution of fully qualified column name ( 3 part name) by skambha · Pull Request #17185 · apache/spark

skambha · 2017-03-07T04:47:29Z

What changes were proposed in this pull request?

The design details is attached to the JIRA issue here

High level overview of the changes are:

Enhance the qualifier to be more than one string
Add support to store the qualifier. Enhance the lookupRelation to keep the qualifier appropriately.
Enhance the table matching column resolution algorithm to account for qualifier being more than a string.
Enhance the table matching algorithm in UnresolvedStar.expand
Ensure that we continue to support select t1.i1 from db1.t1

How was this patch tested?

New tests are added.
Several test scenarios were added in a separate test pr 17067. The tests that were not supported earlier are marked with TODO markers and those are now supported with the code changes here.
Existing unit tests ( hive, catalyst and sql) were run successfully.

skambha · 2017-03-07T22:54:31Z

cc @gatorsmile, @cloud-fan I'd really appreciate your review and comments. Thanks much.

cloud-fan · 2017-03-08T00:31:01Z

sql/core/src/test/resources/sql-tests/inputs/columnresolution-views.sql

I'm not sure if this is a valid case. We should treat SELECT ... FROM ... as a 2-levels look up: Level 1, in FROM statement, we specify the table name with an optional qualifier(database name). Level 2, in SELECT statement, we specify the column name with an optional qualifier(table name).

I don't think it makes sense to specify the database name for column name, the table lookup should be done in the FROM statement.

cc @gatorsmile

Yes, but @skambha did a comparison between different DB. Basically, what she tried to do is to make Spark SQL consistent with the others. See the design doc:
https://issues.apache.org/jira/secure/attachment/12854681/Design_ColResolution_JIRA19602.pdf

Thanks @gatorsmile and @cloud-fan.

Hi @cloud-fan,
The use-case is to support fully qualified column name in queries to remove ambiguity in the naming.

For e.g. select db2.t1.i1 from db1.t1, db2.t1 where db1.t1.i2 = db2.t1.i3

Today Spark does not support this. The workaround is to create alias for the table.

Yes, the lookup for the table is coming from the ‘from’ clause. At a high level, changes include keeping track of the qualifier(including the database name) when the lookup happens and enhancing the column resolution logic in the analyzer to resolve fully qualified column name from a logical plan node’s children.
There are more details in the design doc attached to the SPARK-19602

Thanks.

cloud-fan · 2018-01-19T03:48:24Z

I agree it's a valid use case, do you wanna bring it up to date? sorry for the delay!

skambha · 2018-01-22T19:42:38Z

sure. Let me look into it. Thanks.

cloud-fan · 2018-01-23T17:46:12Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/unresolved.scala

I think we can just use Seq[String]

I followed the style that exists today. If we change here, then there are other places where it needs to change as well. For e.g, in AttributeReference we initialize it with None today and now it will be a Seq.empty. I am not sure if we want to create these objects or just leave it using None by keeping the Option[]

If you have a strong preference, I can update it. I do have the changes locally but have not pushed that version. Thanks.

skambha · 2018-02-01T00:26:59Z

I have rebased and pushed the changes. I ran the unit tests ( sql, catalyst and hive).
Earlier, I was having issues running the hive test suite locally but that is resolved with the fix from SPARK-23275.

Please take a look. Thanks.

gatorsmile · 2018-02-01T00:58:24Z

ok to test

SparkQA · 2018-02-01T04:15:40Z

Test build #86899 has finished for PR 17185 at commit 8206dc3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mafernandez-stratio · 2018-02-09T14:25:45Z

Any plans of merging this PR?

gatorsmile · 2018-02-09T21:05:43Z

@mafernandez-stratio We plan to review and merge after 2.3 release. The target release will be 2.4

mafernandez-stratio · 2018-02-10T08:15:02Z

@gatorsmile Great, thanks for the feedback!

gatorsmile · 2018-07-30T22:15:52Z

ok to test

SparkQA · 2018-07-31T02:07:19Z

Test build #93810 has finished for PR 17185 at commit 8206dc3.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds no public classes.

cloud-fan · 2018-07-31T02:44:49Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/unresolved.scala

what about SELECT db1.t1.* FROM t1 while the current database is db1?

This (SELECT db1.t1.* FROM t1) would resolve correctly.

Also a similar scenario (details in Section 2 in Table A # 5 in the design doc)
select db1.t1.i1 from t1 will resolve correctly when the current database is db1. This is tested here

I will explicitly add a test case for the db1.t1.* from t1 as well.

cloud-fan · 2018-07-31T02:51:35Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala

I think we can make SubqueryAlias take a qualifier: Seq[String] instead of alias: String.

Here, SubqueryAlias takes in a AliasIdentifier and not a string.

Also want to mention that: we considered somewhat similar to your suggestion - to simply add a qualifier as a Seq[String] as a member of SubqueryAlias but that change would require touching more files.

I had implemented that first (long time back now) and then after some offline discussions we favored this as this change is more localized and encapsulated. @gatorsmile @cloud-fan

cloud-fan · 2018-07-31T02:53:51Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala

Again, I feel using Seq[String] directly can simplify a lot of code.

I'll look into this.

I took care of this change in the recent push. Thanks.

cloud-fan · 2018-07-31T03:02:39Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala

can you briefly explain the new resolution logic? I feel it's a little convoluted now as we are more likely to be ambiguous.

The resolution logic needs to account to compare the full qualifier. If you are trying to resolve db1.t1.i1, the match needs to happen against the entire qualifier of the attribute.  

For e.g:

select db1.t1.i1 from t1 // current database is db1 and lets say t1 has columns i1, i2.

When resolving db1.t1.i1, against the following set of Input attributes.
we will have the following attribute references:

AttributeReference ( qualifier Seq(db1, t1) and name is i1)

AttributeReference ( qualifier Seq(db1, t1) and name is i2)

So the new resolution logic will match the entire qualifier sequence and then match the column ( ie the attribute name)

At a high level the changes include:

The column resolution needs to account for the qualifier that is now a sequence of string.

Enhance the lookupRelation to add the qualifier appropriately. Right now, the qualifier used by the LogicalPlan nodes representing the relation leafnodes use only the table name as the qualifier when no alias is provided. This needs to be updated to add the database name and the table name for permanent views and tables and global temporary views.

 Full details of the design and the resolution logic is in the doc here in Section 3.3 and 3.4.

So, Section 2 in the design doc has a lot of scenarios that were tested against different db’s and Spark and the expected behavior with the design here.

If there is a specific scenario you have in mind that would be valid but get flagged as ambiguous, can you please let me know. I’d be glad to look into it.

skambha · 2018-08-01T17:56:25Z

@gatorsmile , @cloud-fan, just a quick comment, I have been working on this and will respond soon.

skambha · 2018-08-01T21:01:30Z

I rebased and found out that the resolution code in Logical plan has changed and it uses map lookup to do the matching. I have some ideas on how to incorporate the 3 part name with the map lookup logic.

For now, I have synced up and bypassed the new map logic and pushed the code so it is up to the latest so I can get the test cycle from GitHub and added todo’s for myself to incorporate the optimized lookup logic.

@cloud-fan, In the meantime, if you have any other comments please let me know. Thanks.

SparkQA · 2018-08-02T01:39:56Z

Test build #93904 has finished for PR 17185 at commit 05d6c0f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-08-02T05:21:18Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/unresolved.scala

do we assume the attribute always carry the full qualifier? and i.e. the analyzer will always include the current database in the qualifier.

The database information gets carried over based on the lookupRelation logic. Details are in Section 3.3.3 in the design doc here

cloud-fan · 2018-08-02T05:25:18Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/unresolved.scala

it's weird to use pattern match like this, maybe better to write

nameParts.corresponds(qualifierList)(resolver) || { if (nameParts.length == 1 && qualifierList.nonEmpty) { resolver(nameParts.head, qualifierList.last) } else { false } }

done. Thanks.

cloud-fan · 2018-08-02T05:26:10Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/unresolved.scala

IIRC users can do select structCol.* from tbl, does it mean the nameParts.head may not be table name?

cloud-fan · 2018-08-02T05:35:48Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala

a seq of 2 elements: database name and table name.

cloud-fan · 2018-08-02T05:36:55Z

overall LGTM, my major concern is how to do O(1) lookup for the 3 part name

skambha · 2018-08-02T18:58:10Z

Thanks @cloud-fan for the review.

I am working on implementing an idea to get optimized lookup with 3part name.

skambha · 2018-08-03T18:06:45Z

retest this please

gatorsmile · 2018-08-04T07:12:18Z

retest this please

cloud-fan

mostly LGTM

cloud-fan · 2018-08-04T07:08:42Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/unresolved.scala

do we need to consider if qualifier is Nil?

Sure. Will add a defensive check.

cloud-fan · 2018-08-04T07:57:45Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala

we may need to check the SQL standard about how to resolve the ambiguity. For a.b.c, it's possible that a is db name, b is table name, c is column name. If there is no column named c, we should fail. It's also possible that a is table name, b is column name, c is nested field name, and it exists. What's the expected behavior here?

In this case, if a.b.c fails to resolve as db.table.column, then we check if there is a table and column that matches a.b and then see if c is a nested field name and if it exists, it will resolve to the nested field.

Tests with struct nested fields are here

Is this behavior defined by SQL standard? IIRC the current behavior without your patch is
col.innerField1.innerField2 can fail even we have a table named col and a column named innerField1 which has a innerField2 field.

Actually it won't fail without this patch too, the col.innerField1.innerField2 will resolve correctly to the struct field if there is a table named col and column named innerField1 which has a innerField2 field.

Please see an example of such a scenario in this test output on master spark:
https://github.com/apache/spark/blob/master/sql/core/src/test/resources/sql-tests/results/columnresolution.sql.out#L360

Let me look into the sql standard and get back on that.

What I'm talking about is ambiguity. col.innerField1.innerField2 can fail if a table has a col column and the col column has a innerField1 field but the innerField2 doesn't exist. My question is: shall we try all the possible resolution paths and pick the valid one, or define a rule that we can decide the resolution path ahead.

@cloud-fan , Thank you for your suggestion and question.

Existing spark behavior follows precedence rules in the column resolution logic and in this patch we are following the same pattern/rule.

I am looking into the SQL standard to see if there are any column resolution rules but I have not found any yet. However when I researched existing databases, I observed different behaviors among them and it is listed in Section 2/Table A in the design doc here.

I agree, we can improve upon the checks in existing precedence to go all the way to ensure there is a nested field. Although, the user can always qualify the field to resolve the ambiguity. Shall we open another issue to discuss and improve upon the existing resolution logic.

cloud-fan · 2018-08-04T07:59:12Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalogSuite.scala

instead of creating AliasIdentifier manually, we can also define a SubqueryAlias.apply which takes db name and table name.

Sure, let me do that if this style is preferred. Thanks.

SparkQA · 2018-08-04T11:17:07Z

Test build #94206 has finished for PR 17185 at commit 90cd6d3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Changes to resolve fully qualified column attribute references from a logical plan node children

…rt name and remove my todos

skambha · 2018-08-05T04:27:42Z

I have addressed the review comments in this commit here

@cloud-fan, please take a look. Thanks.

SparkQA · 2018-08-05T07:05:02Z

Test build #94234 has finished for PR 17185 at commit 065687f.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

dilipbiswal · 2018-08-05T15:57:07Z

retest this please

SparkQA · 2018-08-05T20:00:17Z

Test build #94244 has finished for PR 17185 at commit 065687f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-08-07T01:16:39Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala


  override def sql: String = {
-    val qualifierPrefix = qualifier.map(_ + ".").getOrElse("")
+    val qualifierPrefix = if (qualifier.nonEmpty) qualifier.mkString(".") + "." else ""


nit: val qualifierPrefix = qualifier.mkString("", ".", ".")

This won't work for the case when we have Seq.empty. The suffix "." gets returned even for a empty sequence.
For a non empty Seq, the above call will be fine.

Shall we leave the 'if' as is or is there an equivalent preferred style that would work?

ah my bad, I thought it would return empty string for empty seq. Let's leave it.

ok, sounds good.

cloud-fan · 2018-08-07T01:30:30Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala

-          val key = (qualifier.toLowerCase(Locale.ROOT), name.toLowerCase(Locale.ROOT))
-          val attributes = collectMatches(name, qualified.get(key)).filter { a =>
-            resolver(qualifier, a.qualifier.get)
+      matches = matches match {


nit:

if (matches._1.isEmpty) { matches = ... }

cloud-fan · 2018-08-07T01:31:27Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala

+
+  def apply(
+      identifier: String,
+      database: Option[String],


why it's Option? if no database name, we should call the apply method above

good point! I'll take care of this in the next push.

cloud-fan · 2018-08-07T01:32:50Z

LGTM except some code style comment, thanks for working on it!

skambha · 2018-08-07T03:33:50Z

Thanks for the review. I have addressed your comments and pushed the changes.
@cloud-fan, Please take a look.

SparkQA · 2018-08-07T07:05:01Z

Test build #94335 has finished for PR 17185 at commit 5f7e5d7.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-08-07T07:45:51Z

retest this please

SparkQA · 2018-08-07T12:13:34Z

Test build #94353 has finished for PR 17185 at commit 5f7e5d7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-08-07T13:12:22Z

thanks, merging to master!

gatorsmile · 2018-08-15T06:50:28Z

@skambha @dilipbiswal How about the hint resolution after supporting multi part names?

gatorsmile · 2018-08-15T06:55:15Z

Let us discuss it in the JIRA https://issues.apache.org/jira/browse/SPARK-25121

dilipbiswal · 2018-08-15T06:58:27Z

Good idea @gatorsmile . Thanks !!

cloud-fan reviewed Mar 8, 2017

View reviewed changes

skambha force-pushed the colResolution branch from fb72118 to dc69a85 Compare March 8, 2017 22:19

cloud-fan reviewed Jan 23, 2018

View reviewed changes

skambha force-pushed the colResolution branch from dc69a85 to 8206dc3 Compare February 1, 2018 00:21

cloud-fan reviewed Jul 31, 2018

View reviewed changes

skambha force-pushed the colResolution branch from 8206dc3 to 05d6c0f Compare August 1, 2018 20:58

cloud-fan reviewed Aug 2, 2018

View reviewed changes

cloud-fan reviewed Aug 4, 2018

View reviewed changes

skambha added 6 commits August 4, 2018 15:59

Squashed commits :

72830ce

Changes to resolve fully qualified column attribute references from a logical plan node children

Rebase, bypass the map logic and add todos for self and add a testcase

ee3dd8f

Integrate with the map lookup logic and add optimized lookup for 3 pa…

13ed442

…rt name and remove my todos

Remove extra space

6a7c08f

Resolve conflict after rebase

d8fca00

address review comments

065687f

skambha force-pushed the colResolution branch from 90cd6d3 to 065687f Compare August 5, 2018 04:22

cloud-fan reviewed Aug 7, 2018

View reviewed changes

cloud-fan mentioned this pull request Aug 7, 2018

[SPARK-25006][SQL] Add CatalogTableIdentifier. #21978

Closed

Address review comments - style related

5f7e5d7

asfgit closed this in b4bf8be Aug 7, 2018

cloud-fan mentioned this pull request Jan 31, 2020

[SPARK-30612][SQL] Resolve qualified column name with v2 tables #27391

Closed

Conversation

skambha commented Mar 7, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

skambha commented Mar 7, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Jan 19, 2018

Uh oh!

skambha commented Jan 22, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

skambha commented Feb 1, 2018

Uh oh!

gatorsmile commented Feb 1, 2018

Uh oh!

SparkQA commented Feb 1, 2018

Uh oh!

mafernandez-stratio commented Feb 9, 2018

Uh oh!

gatorsmile commented Feb 9, 2018

Uh oh!

mafernandez-stratio commented Feb 10, 2018

Uh oh!

gatorsmile commented Jul 30, 2018

Uh oh!

SparkQA commented Jul 31, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

skambha Aug 1, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

skambha commented Aug 1, 2018

Uh oh!

skambha commented Aug 1, 2018

Uh oh!

SparkQA commented Aug 2, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

skambha commented Mar 7, 2017 •

edited

Loading

skambha Aug 1, 2018 •

edited

Loading

skambha Aug 5, 2018 •

edited

Loading

cloud-fan Aug 6, 2018 •

edited

Loading